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Abstract 

The theory of large deviations deals with the probabilities of rare 
events (or fluctuations) that are exponentially small as a function of 
some parameter, e.g., the number of random components of a system, 
the time over which a stochastic system is observed, the amplitude 
of the noise perturbing a dynamical system or the temperature of 
a chemical reaction. The theory has applications in many different 
scientific fields, ranging from queuing theory to statistics and from 
finance to engineering. It is also increasingly used in statistical physics 
for studying both equilibrium and noncquilibrium systems. In this 
context, deep analogies can be made between familiar concepts of 
statistical physics, such as the entropy and the free energy, and concepts 
of large deviation theory having more technical names, such as the rate 
function and the scaled cumulant generating function. 

The first part of these notes introduces the basic elements of large 
deviation theory at a level appropriate for advanced undergraduate and 
graduate students in physics, engineering, chemistry, and mathematics. 
The focus there is on the simple but powerful ideas behind large devia- 
tion theory, stated in non-technical terms, and on the application of 
these ideas in simple stochastic processes, such as sums of independent 
and identically distributed random variables and Markov processes. 
Some physical applications of these processes are covered in exercises 
contained at the end of each section. 

In the second part, the problem of numerically evaluating large 
deviation probabilities is treated at a basic level. The fundamental idea 
of importance sampling is introduced there together with its sister idea, 
the exponential change of measure. Other numerical methods based on 
sample means and generating functions, with applications to Markov 
processes, are also covered. 



'Published in R. Leidl and A. K. Hartmann (eds). Modem Computational Science 11: 
Lecture Notes from the 3rd International Oldenburg Summer School, BIS-Verlag der Carl 
von Ossietzky Universitat Oldenburg, 2011. 
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1 Introduction 



The goal of these lecture notes, as the title says, is to give a basic introduction 
to the theory of large deviations at three levels: theory, applications and 
simulations. The notes follow closely my recent review paper on large 
deviations and their applications in statistical mechanics |48j . but are, in a 
way, both narrower and wider in scope than this review paper. 

They are narrower, in the sense that the mathematical notations have 
been cut down to a minimum in order for the theory to be understood by 
advanced undergraduate and graduate students in science and engineering, 
having a basic background in probability theory and stochastic processes 
(see, e.g., [27]). The simplification of the mathematics amounts essentially 
to two things: i) to focus on random variables taking values in M or M^, 
and ii) to state all the results in terms of probability densities, and so 
to assume that probability densities always exist, if only in a weak sense. 
These simplifications are justified for most applications and are convenient 
for conveying the essential ideas of the theory in a clear way, without the 
hindrance of technical notations. 

These notes also go beyond the review paper [48j, in that they cover 
subjects not contained in that paper, in particular the subject of numerical 
estimation or simulation of large deviation probabilities. This is an important 
subject that I intend to cover in more depth in the future. 

Sections |4] and [5] of these notes are a first and somewhat brief attempt in 
this direction. Far from covering all the methods that have been developed 
to simulate large deviations and rare events, they concentrate on the central 
idea of large deviation simulations, namely that of exponential change of 
measure, and they elaborate from there on certain simulation techniques that 
are easily applicable to sums of independent random variables, Markov chains, 
stochastic differential equations and continuous-time Markov processes in 
general. 

Many of these applications are covered in the exercises contained at the 
end of each section. The level of difficulty of these exercises is quite varied: 
some are there to practice the material presented, while others go beyond 
that material and may take hours, if not days or weeks, to solve completely. 
For convenience, I have rated them according to Knuth's logarithmic scalej^ 

In closing this introduction, let me emphasize again that these notes are 
not meant to be complete in any way. For one thing, they lack the rigorous 

^00 = Immediate; 10 = Simple; 20 = Medium; 30 = Moderately hard; 40 = Term 
project; 50 = Research problem. See the superscript attached to each exercise. 
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notation needed for handling large deviations in a precise mathematical way, 
and only give a hint of the vast subject that is large deviation and rare event 
simulation. On the simulation side, they also skip over the subject of error 
estimates, which would fill an entire section in itself. In spite of this, I hope 
that the material covered will have the effect of enticing readers to learn more 
about large deviations and of conveying a sense of the wide applicability, 
depth and beauty of this subject, both at the theoretical and computational 
levels. 



2 Basic elements of large deviation theory 
2.1 Examples of large deviations 

We start our study of large deviation theory by considering a sum of real 
random variables (RV for short) having the form 



1 " 

5„ = -Vx,. (2.1) 

T7. '—^ 



n 
1=1 



Such a sum is often referred to in mathematics or statistics as a sample 
mean. We are interested in computing the probability density function 
VSn (-5) (pdf for short) of Sn in the simple case where the n RVs are mutually 
independent and identically distributed (IID for short) This means 
that the joint pdf of . . . , X„, factorizes as follows: 

n 

p(Xi,...,X„) = J];p(Xi), (2.2) 

4 = 1 

with piXi) a fixed pdf for each of the ^i'sj^ 
We compare next two cases for 'piXi): 

• Gaussian pdf: 

V{x) = ^^e-^^-'^)'/^^.^) ^ ^ ^ ^2.3) 
where fx = E[X] is the mean oi X and cr^ = E[{X — fi)^] its variance]^ 



The pdf of Sn will be denoted by Ps„(s) or more simply by p{Sn) when no confusion 



arises 

3 



To avoid confusion, we should write pxi{^), but in order not to overload the notation, 
we will simply write p{Xi). The context should make clear what RV we are referring to. 

*The symbol E[X] denotes the expectation or expected value of X, which is often 
denoted by (X) in physics. 
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Exponential pdf: 



= -e"^/^, xG[0,oo), (2.4) 



with mean E[X] = /i > 0. 
What is the form of ps„ (s) for each pdf? 

To find out, we write the pdf associated with the event Sn = shy summing 
the pdf of all the values or realizations (xi, . . . , x„) G M" of Xi, . . . , Xn 
such that Sn = sj^ In terms of Dirac's delta function 6(x), this is written as 

P,.(.)=/d..../d.„*(Et...-».)p(.... ...„). (2.5) 

JM. Jm. 

From this expression, we can then obtain an explicit expression for ps„{s) by 



using the method of generating functions (see Exercise 2.7.1) or by substi- 
tuting the Fourier representation of 5{x) above and by explicitly evaluating 
the n + 1 resulting integrals. The result obtained for both the Gaussian and 
the exponential densities has the general form 

PsAs)^e--'^^\ (2.6) 

where 



for the Gaussian pdf, whereas 



I(.) = ^::_^, .eM (2.7) 



/(s) = - - 1 -In-, s>0 (2.8) 

for the exponential pdf. 

We will come to understand the meaning of the approximation sign (~) 
more clearly in the next subsection. For now we just take it as meaning that 
the dominant behaviour of p{Sn) as a function of n is a decaying exponential 
in n. Other terms in n that may appear in the exact expression of p{Sn) are 
sub-exponential in n. 

The picture of the behaviour of p{Sn) that we obtain from the result 



of (2.6) is that p{Sn) decays to exponentially fast with n for all values 



Sn = s for which the function I{s), which controls the rate of decay of 



^Whenever possible, random variables will be denoted by uppercase letters, while their 
values or realizations will be denoted by lowercase letters. This follows the convention used 
in probability theory. Thus, we will write X = a; to mean that the RV X takes the value x. 
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« — n=100 






Figure 2.1: (Left) pdf ps„(s) of the Gaussian sample mean Sn ioi fj, = a = 1. 
(Right) In{s) = — ^ lnp5',j(s) for different values of n demonstrating a rapid 
convergence towards the rate function I{s) (dashed line). 



p{Sn), is positive. But notice that I{s) > for both the Gaussian and 
exponential densities, and that I{s) = in both cases only for s = /x = 
Therefore, since the pdf of Sn is normalized, it must get more and more 



concentrated around the value s = ^ as n — t- oo (see Fig. 2.1), which means 
that ps„{s) 6{s — n) in this limit. We say in this case that Sn converges 
in probability or in density to its mean. 

As a variation of the Gaussian and exponential samples means, let us 
now consider a sum of discrete random variables having a probability 
distribution P{Xi) instead of continuous random variables with a pdf p(Xj)j^ 
To be specific, consider the case of IID Bernoulli RVs Xi, . . . , X„ taking 
values in the set {0,1} with probability P{Xi = 0) = 1— a and P{Xi = 1) = a. 
What is the form of the probability distribution P{Sn) of 5„ in this case? 

Notice that we are speaking of a probability distribution because Sn is 
now a discrete variable taking values in the set {0, 1/n, 2/n, . . . , (n — l)/n, 1}. 
In the previous Gaussian and exponential examples, Sn was a continuous 
variable characterized by its pdf p{Sn)- 

With this in mind, we can obtain the exact expression of P{Sn) using 



methods similar to those used to obtain p{Sn) (see Exercise 2.7.4). The 
result is different from that found for the Gaussian and exponential densities, 
but what is remarkable is that the distribution of the Bernoulli sample mean 



® Following the convention in probability theory, we use the lowercase p for continuous 
probability densities and the uppercase P for discrete probability distributions and proba- 
bility assignments in general. Moreover, following the notation used before, we will denote 
the distribution of a discrete Sn by fs„(s) or simply P(Sn)- 
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also contains a dominant exponential term having the form 



-nl{s) 



(2. 



where I{s) is now given by 



/(s) =sln- + (l-s)ln^ -, sG[0,l]. 

a 1 — a 



(2.10) 



The behaviour of the exact expression of P{Sn) as n grows is shown in 



Fig. 2.2 together with the plot of I{s) as given by Eq. (2.10). Notice how 
P{Sn) concentrates around its mean fi = E[Xi] = a as a result of the fact 
that I{s) > and that s = fi is the only value of Sn for which / = 0. Notice 
also how the support of P{Sn) becomes "denser" as n — )■ oo, and compare 
this property with the fact that I{s) is a continuous function despite Sn being 
discrete. Should I{s) not be defined for discrete values if Sn is a discrete 
RV? We address this question next. 



2.2 The large deviation principle 

The general exponential form e~"^^*^ that we found for our three previous 
sample means (Gaussian, exponential and Bernoulli) is the founding result or 
property of large deviation theory, referred to as the large deviation principle. 
The reason why a whole theory can be built on such a seemingly simple 
result is that it arises in the context of many stochastic processes, not just 
IID sample means, as we will come to see in the next sections, and as can be 
seen from other contributions to this volume; see, e.g., Engel'sj^ 

The rigorous, mathematical definition of the large deviation principle 
involves many concepts of topology and measure theory that are too technical 
to be presented here (see Sec. 3.1 and Appendix B of [48j). For simplicity, 
we will say here that a random variable Sn or its pdf p{Sn) satisfies a large 
deviation principle (LDP) if the following limit exists: 

hm --lnp5„(s) =/(s) (2.11) 

n— >oo TL 

and gives rise to a function I{s), called the rate function, which is not 
everywhere zero. 

The relation between this definition and the approximation notation used 
earlier should be clear: the fact that the behaviour of ps„{s) is dominated 

'^The volume referred to is the volume of lecture notes produced for the summer school; 
see http : / /www . mcs . uni-oldenburg . de/ 
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Figure 2.2: (Top left) Discrete probability distribution Ps„{s) of the Bernoulli 
sample mean for a = 0.4 and different values of n. (Top right) Finite-n rate 
function In{s) = — MnPs^(s). The rate function /(s) is the dashed line. 
(Bottom left). Coarse-grained pdf ps„is) for the Bernoulli sample mean. 
(Bottom right) I„(s) = — ^ liip5„(s) as obtained from the coarse-grained pdf. 



for large n by a decaying exponential means that the exact pdf of Sn can be 
written as 

P5„(s) =e-"^W+°('^) (2.12) 

where o{n) stands for any correction term that is sub-linear in n. Taking the 
large deviation limit of (2.11) then yields 

lim lnps„(s) = - lim = /(s), (2.13) 

n— >oo Ti n— >oo n 



since o{n)/n — t- 0. We see therefore that the large deviation limit of Eq. ( 2.11 ) 
is the limit needed to retain the dominant exponential term in p{Sn) while 
discarding any other sub-exponential terms. For this reason, large deviation 
theory is often said to be concerned with estimates of probabilities on the 
logarithmic scale. 
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This point is illustrated in Fig. 



2.2 



There we see that the function 



/„(s) = --lnps„(s) (2.14) 
n 

is not quite equal to the limiting rate function /(s), because of terms of order 
o{n)/n = 0(1); however, it does converge to I{s) as n gets larger. Plotting 
p{Sn) on this scale therefore reveals the rate function in the limit of large 
n. This convergence will be encountered repeatedly in the sections on the 
numerical evaluation of large deviation probabilities. 

It should be emphasized again that the definition of the LDP given above 
is a simplification of the rigorous definition used in mathematics, due to 
the mathematician Srinivas Varadhanj^ The real, mathematical definition 
is expressed in terms of probability measures of certain sets rather than in 
terms of probability densities, and involves upper and lower bounds on these 
probabilities rather than a simple limit (see Sec. 3.1 and Appendix B of 
|48] ) . The mathematical definition also applies a priori to any RVs, not just 
continuous RVs with a pdf. 

In these notes, we will simplify the mathematics by assuming that the 
random variables or stochastic processes that we study have a pdf. In fact, 
we will often assume that pdfs exist even for RVs that are not continuous 
but "look" continuous at some scale. 

To illustrate this point, consider again the Bernoulli sample mean. We 
noticed that 5„ in this case is not a continuous RV, and so does not have 
a pdf. However, we also noticed that the values that Sn can take become 
dense in the interval [0, 1] as n — )■ 00, which means that the support of the 
discrete probability distribution P(5„) becomes dense in this limit, as shown 
in Fig. |2.2[ From a practical point of view, it therefore makes sense to treat 
Sn as a continuous RV for large n by interpolating P{Sn) to a continuous 
function p{Sn) representing the "probability density" of Sn- 

This pdf is obtained in general simply by considering the probability 
P{Sn G [sjS + As]) that Sn takes a value in a tiny interval surrounding 
the value s, and by then dividing this probability by the "size" As of that 
interval: 

P(S„€[.,>- + A,1) 

As 

The pdf obtained in this way is referred to as a smoothed, coarse-grained 

* Recipient of the 2007 Abel Prize for his "fundamental contributions to probability 
theory and in particular for creating a unified theory of large deviations." 
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if the spacing As between the values of Sn is chosen to be 1/n. 

This process of replacing a discrete variable by a continuous one in some 
limit or at some scale is known in physics as the continuous or continuum 
limit or as the macroscopic limit. In mathematics, this limit is expressed 
via the notion of weak convergence (see |16j). 

2.3 The Gartner-Ellis Theorem 

Our goal in the next sections will be to study random variables and stochastic 
processes that satisfy an LDP and to find analytical as well as numerical ways 
to obtain the corresponding rate function. There are many ways whereby a 
random variable, say Sn, can be shown to satisfy an LDP: 

• Direct method: Find the expression of p{Sn) and show that it has 
the form of the LDP; 

• Indirect method: Calculate certain functions of Sn, such as generat- 
ing functions, whose properties can be used to infer that Sn satisfies 
an LDP; 

• Contraction method: Relate Sn to another random variable, say 
An, which is known to satisfy an LDP and derive from this an LDP 



We have used the first method when discussing the Gaussian, exponential 
and Bernoulli sample means. 

The main result of large deviation theory that we will use in these 
notes to obtain LDPs along the indirect method is called the Gartner-Ehis 
Theorem (GE Theorem for short), and is based on the calculation of the 
following function: 



To be more precise, we should make clear that the coarse-grained pdf depends on the 
spacing As by writing, say, ps„,As(s). However, to keep the notation to a minimum, we 
will use the same lowercase p to denote a coarse-grained pdf and a "true" contirmous pdf. 
The context should make clear which of the two is used. 



for Sn- 




n— >oo n 



(2.17) 
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known as the scaled cumulant generating functior^'^ (SCGF for short). 
In this expression E[-] denotes the expected value, A; is a real parameter, and 
Sn is an arbitrary RV; it is not necessarily an IID sample mean or even a 
sample mean. 

The point of the GE Theorem is to be able to calculate X{k) without 
knowing p{Sn)- We will see later that this is possible. Given A(A;), the GE 
Theorem then says that, if A(A;) is differentiablej^ then 

• Sn satisfies an LDP, i.e., 

lira --Inps^is) = I{s); (2.18) 

n— 5>oo n 

• The rate function I{s) is given by the Legendre-Fenchel transform 

of A(A;): 

I{s) = sup{ks - X{k)}, (2.19) 



where "sup" stands for the supremum 
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We will come to calculate X{k) and its Legendre-Fenchel transform for 
specific RVs in the next section. It will also be seen in there that X{k) 
does not always exist (this typically happens when the pdf of a random 
variable does not admit an LDP). Some exercises at the end of this section 
illustrate many useful properties of X{k) when this function exists and is 
twice differentiable. 



2.4 Varadhan's Theorem 

The rigorous proof of the GE Theorem is too technical to be presented here. 
However, there is a way to justify this result by deriving in a heuristic way 
another result known as Varadhan's Theorem. 

The latter theorem is concerned with the evaluation of a functional^ 
expectation of the form 

Wn[f] = i?[e"^(^")] = / psAs) e"^(^) ds, (2.20) 

^°The function E[e^'^] for k real is known as the generating function of the RV X; 
lni?[e*'^] is known as the log-generating function or cumulant generating function. 

The word "scaled" comes from the extra factor 1/n. 

^^The statement of the GE Theorem given here is a simplified version of the full result, 
which is essentially the result proved by Gartner |22j . See Sec. 3.3 of [48.] for a more 
complete presentation of GE Theorem and [IS] for a rigorous account of it. 

^■^In these notes, a sup can be taken to mean the same as a max. 

^^A function of a function is called a functional. 
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where / is some function of Sn, which is taken to be a real RV for simplicity. 
Assuming that Sn satisfies an LDP with rate function /(s), we can write 

Wnif]- [ e"[-^(^)-^(^)] ds. (2.21) 

with sub-exponential corrections in n. This integral has the form of a so- 
called Laplace integral, which is known to be dominated for large n by 
its largest integrand when it is unique. Assuming this is the case, we can 
proceed to approximate the whole integral as 

Wnif] ~ e"'"P-[-^(")-^(')l. (2.22) 

Such an approximation is referred to as a Laplace approximation, the 
Laplace principle or a saddle-point approximation (see Chap. 6 of 
fS]), and is justified in the context of large deviation theory because the 
corrections to this approximation are sub-exponential in n, as are those of 
the LDP. By defining the following functional: 

A[/] = lim -InWnif) (2.23) 

n— >oo n 

using a limit similar to the limit defining the LDP, we then obtain 

X[f]=sup{f{s)-I{s)}. (2.24) 



The result above is what is referred to as Varadhan Theorem 
The contribution of Varadhan was to prove this result for a large class of RVs, 
which includes not just IID sample means but also random vectors and even 
random functions, and to rigorously handle all the heuristic approximations 
used above. As such, Varadhan's Theorem can be considered as a rigorous 
and general expression of the Laplace principle. 

To connect Varadhan's Theorem with the GE Theorem, consider the 



special case f{s) = ks with k G MF^ Then Eq (|2.24|) becomes 



X{k) = sup{ks- I{s)}, (2.25) 



where A(fc) is the same function as the one defined in Eq. (2.17). Thus we see 



that if Sn satisfies an LDP with rate function I{s), then the SCGF \{k) of 



^^Varadhan's Theorem holds in its original form for bounded functions /, but another 
stronger version can be applied to unbounded functions and in particular to f{x) = kx; 
see Theorem 1.3.4 of \M- 
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Sn is the Legendre-Fenchel transform of I{s). This in a sense is the inverse 
of the GE Theorem, so an obvious question is, can we invert the equation 
above to obtain I(s) in terms of A(/c)? The answer provided by the GE 
Theorem is that, a sufficient condition for I{s) to be the Legendre-Fenchel 
transform of X{k) is for the latter function to be differentiable^^ 

This is the closest that we can get to the GE Theorem without proving it. 
It is important to note, to fully appreciate the importance of that theorem, 
that it is actually more just than an inversion of Varadhan's Theorem because 
the existence of A(A:) implies the existence of an LDP for Sn- In our heuristic 
inversion of Varadhan's Theorem we assumed that an LDP exists. 



2.5 The contraction principle 

We mentioned before that LDPs can be derived by a contraction method. 
The basis of this method is the following: let An be a random variable 
An known to have an LDP with rate function and consider another 

random variable Bn which is a function of the first Bn = /(j4„). We want 
to know whether Bn satisfies an LDP and, if so, to find its rate function. 
To find the answer, write the pdf of Bn in terms of the pdf of An- 

PbM= [ PAAa)da (2.26) 

Jia:f(a)=b\ 



and use the LDP for A„ to write 



PBjb)^ [ e-"^^(")da. (2.27) 

J{a:f{a)=b} 

Then apply the Laplace principle to approximate the integral above by its 
largest term, which corresponds to the minimum of 1(a) for a such that 
b = f{a). Therefore, 



PbM « exp -n inf /^(a) . (2.28) 

V W-f{a)=b} J 

This shows that p{Bn) also satisfies an LDP with rate function /_b(6) given 
by 

lB{b) = ^ inf lA{a). (2.29) 

{a:f{a)=b} 



^^The differentiability condition, though sufficient, is not necessary: /(s) in some cases 
can be the Legendre-Fenchel transform of \{k) even when the latter is not differentiable; 
see Sec. 4.4 of gS]. 



13 



This formula is called the contraction principle because / can be 
many-to-one, i.e., there might be many a's such that 6 = /(a), in which case 
we are "contracting" information about the rate function of An down to Bn- 
In physical terms, this formula is interpreted by saying that an improbable 



fluctuatior of Bn is brought about by the most probable of all improbable 
fluctuations of An- 

The contraction principle has many applications in statistical physics. 
In particular, the maximum entropy and minimum free energy principles, 
which are used to find equilibrium states in the microcanonical and canonical 
ensembles, respectively, can be seen as deriving from the contraction principle 
(see Sec. 5 of 



2.6 From small to large deviations 

An LDP for a random variable, say Sn again, gives us a lot of information 
about its pdf. First, we know that p{Sn) concentrates on certain points 
corresponding to the zeros of the rate function I{s). These points correspond 
to the most probable or typical values of S„ in the limit n — )• cxd and can 
be shown to be related mathematically to the Law of Large Numbers 
(LLN for short). In fact, an LDP always implies some form of LLN (see 
Sec. 3.5.7 of HE]). 

Often it is not enough to know that Sn converges in probability to some 
values; we may also want to determine the likelihood that Sn takes a value 
away but close to its typical value(s). Consider one such typical value s* and 
assume that I{s) admits a Taylor series around s*: 

I(s) = I{s*) + I'{s*){s - s*) + ^^^{s - s*f + ■■■ . (2.30) 

Since s* must correspond to a zero of I{s), the first two terms in this series 
vanish, and so we are left with the prediction that the small deviations of 
Sn around its typical value are Gaussian-distributed: 

P5„(s)«e-'^^"(^*)(^-^*)'/2. (2.31) 

In this sense, large deviation theory contains the Central Limit The- 
orem (see Sec. 3.5.8 of |48|). At the same time, large deviation theory 
can be seen as an extension of the Central Limit Theorem because it gives 
information not only about the small deviations of Sn, but also about its 

'^^In statistical physics, a deviation of a random variable away from its typical value is 
referred to as a fluctuation. 
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large deviations far away from its typical value(s); hence the name of the 
theoryp] 



2.7 Exercises 

2.7.lM^^ (Generating functions) Let An be a sum of n IID RVs: 



An = ^Xi. (2.32) 



1=1 



Show that the generating function of A„, defined as 

WaM = Eie^"^-], (2.33) 
satisfies the following factorization property: 

n 

WAAk) = ll^xAk) = WxAkT, (2.34) 
1=1 

Wxi{k) being the generating function of Xi. 

2.7.2. [-'-'^] (Gaussian sample mean) Find the expression of Wx{k) for X 
distributed according to a Gaussian pdf with mean ^ and variance cj^ . From 
this result, find Ws„{k) following the previous exercise and p{Sn) by inverse 
Laplace transform. 

2.7.3. [-'-^] (Exponential sample mean) Repeat the previous exercise for the ex- 
ponential pdf shown in Eq. (2.4). Obtain from your result the approximation 
shown in (2.8). 

2.7. 4^^^ (Bernoulli sample mean) Show that the probability distribution of 
the Bernoulli sample mean is the binomial distribution: 

= ( ^ 1. - (2.35) 

(sn)![(l — sjnj! 



Use Stirling's approximation to put this result in the form of (2.9) with I{s) 
given by Eq. (2.10). 

2.7.5. ["""^l (Multinomial distribution) Repeat the previous exercise for IID 
RVs taking values in the set {0, 1, . . . , g — 1} instead of {0, 1}. Use Stirling's 
approximation to arrive at an exponential approximation similar to (2.9). 
(See solution in [20p 



'^A large deviation is also called a large fluctuation or a rare and extreme event. 



15 



2.7.6. [^°] (SCGF at the origin) Let A(A;) be the SCGF of an IID sample 
mean S'„. Prove the foUowing properties of A(A;) at /c = 0: 

• A(0) = 

. A'(0) = E[Sn] = E[X,] 

• A"(0) = var(Xi) 

Which of these properties remain vahd if Sn is not an IID sample mean, but 
is some general RV? 

2.7.7. [-'-^] (Convexity of SCGFs) Show that A(A;) is a convex function of k. 

2.7.8. [-'-^] (Legendre transform) Show that, when X{k) is everywhere dif- 
ferentiable and is strictly convex (i.e., has no affine or linear parts), the 



Legendre-Fenchel transform shown in Eq. (2.19) reduces to the Legendre 
transform: 

I{s) = k{s)s- X{k{s)), (2.36) 

where k{s) is the unique of X'{k) = s. Explain why the latter equation has a 
unique root. 

2.7.9. '^°^ (Convexity of rate functions) Prove that rate functions obtained 
from the GE Theorem are strictly convex. 

2.7.10. [''"^] (Varadhan's Theorem) Verify Varadhan's Theorem for the Gaus- 
sian and exponential sample means by explicitly calculating the Legendre- 
Fenchel transform of I{s) obtained for these sample means. Compare your 
results with the expressions of A(A;) obtained from its definition. 



3 Applications of large deviation theory 

We study in this section examples of random variables and stochastic processes 
with interesting large deviation properties. We start by revisiting the simplest 
application of large deviation theory, namely, IID sample means, and then 
move to Markov processes, which are often used for modeling physical and 
man-made systems. More information about applications of large deviation 
theory in statistical physics can be found in Sees 5 and 6 of |48j . as well as 
in the contribution of Engel contained in this volume. 
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3.1 IID sample means 

Consider again the sample mean 



5, 



1 " 

'n = -VXi (3.1) 



n 

i=l 



involving n real IID variables Xi, . . . , Xn with common pdf p{x). To de- 
termine whether Sn satisfies an LDP and obtain its rate function, we can 
follow the GE Theorem and calculate the SCGF A(A;) defined in Eq. (2.17). 
Because of the IID nature of Sm A(fc) takes a simple form: 

\{k) = \iiE[e^^]. (3.2) 

In this expression, X is any of the Xj's (remember they are IID). Thus to 
obtain an LDP for 5„, we only have to calculate the log-generating function 
of X, check that it is differentiable and, if so, calculate its Legendre-Fenchel 



transform (or in this case, its Legendre transform; see Exercise 2.7.8). Exer- 
cise 3.6.1 considers different distributions p{x) for which this calculation can 
be carried out, including the Gaussian, exponential and Bernoulli distribu- 
tions studied before. 

As an extra exercise, let us attempt to find the rate function of a special 
sample mean defined by 

1 " 

i=l 

for a sequence Xi , . . . , X^ of n discrete IID RVs with finite state space 
X = {1, 2, . . . , q}. This sample mean is called the empirical distribution 
of the Xj's as it "counts" the number of times the value or symbol j £ X 
appears in a given realization of Xi, . . . , X„. This number is normalized by 
the total number n of RVs, so what we have is the empirical frequency 
for the appearance of the symbol j in realizations of Xi, . . . , X„. 

The values of Lnj for all j £ X can be put into a vector L„, called the 
empirical vector. For the Bernoulli sample mean, for example, X = {0, 1} 
and so is a two-dimensional vector L„ = (Ln,o, Ln,i) containing the 
empirical frequency of O's and I's appearing in a realization xi, . . . ,Xn of 

Xl, . . . , Xn- 

# O's in Xl, . . . ,x„ # I's in Xl, . . . ,Xn 

^n,0 = , ^n,l = • [oA) 

n n 

To find the rate function associated with the random vector L„, we apply 
the GE Theorem but adapt it to the case of random vectors by replacing k 
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in X{k) by a vector k having the same dimension as Ln* The calculation of 
A(k) is left as an exercise (see Exercise |3.6.8 ). The result is 



A(k) = In ^ Pj (3.5) 

where Pj = P{Xi = j). It is easily checked that A(k) is differentiable in the 
vector sense, so we can use the GE Theorem to conclude that L„ satisfies an 
LDP, which we write as 

p(L„ = /x)p.e-"^('^), (3.6) 
with a vector rate function /(/x) given by the Legendre-Fenchel transform of 



A(k). The calculation of this transform is the subject of Exercise 3.6.8 The 
final result is 

/(,.) = 5^/., In ^. (3.7) 

This rate function is called the relative entropy or Kullback-Leibler 
divergence [8]. The full LDP for L„ is referred to as Sanov's Theorem 
(see im and Sec. 4.2 of 08]). 

It can be checked that /(/x) is a convex function of /x and that it has a 
unique minimum and zero at fi = P, i.e., = Pj for all j G X. As seen 
before, this property is an expression of the LLN, which says here that the 
empirical frequencies Lnj converge to the probabilities Pj as n — t- oo. The 
LDP goes beyond this result by describing the fluctuations of L„ around the 
typical value /x = P. 



3.2 Markov chains 

Instead of assuming that the sample mean Sn arises from a sum of IID RVs 
Xi, . . . , Xn, consider the case where the Xj's form a Markov chain. This 
means that the joint pdf p{Xi, . . . , Xn) has the form 

71-1 

p{Xi, . . . ,Xn) = p{Xi)ll7T{X,+i\Xi), (3.8) 

i=l 

where p{Xi) is some initial pdf for Xi and 7r(Xj+i|Xj) is the transition 
probability density that Xj+i follows Xi in the Markov sequence Xi — >• 
X2 — 7- • • • — )• Xn (see |27j for background information on Markov chains). 

The GE Theorem can still be applied in this case, but the expression of 
X{k) is more complicated. Skipping the details of the calculation (see Sec. 4.3 
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of |48j). we arrive at the following result: provided that the Markov chain is 
homogeneous and ergodic (see for a definition of ergodicity), the SCGF 
of Sn is given by 

X{k)=lnC{Uk), (3.9) 

where Ci^k) is the dominant eigenvalue (i.e., with largest magnitude) of 
the matrix 11^ whose elements 7ffc(x,x') are given by TTk{x,x') = TT{x'\x)e^^ . 
We call the matrix 11^ the tilted matrix associated with Sn- If the Markov 
chain is finite, it can be proved furthermore that A(/c) is analytic and so 
differentiable. From the GE Theorem, we then conclude that Sn has an LDP 
with rate function 

/(s) = sup{A;s-lnC(nfc)}- (3.10) 



Some care must be taken if the Markov chain has an infinite number of states. 
In this case, \{k) is not necessarily analytic and may not even exist (see, e.g., 
[291 [301 EH] for illustrations). 

An application of the above result for a simple Bernoulli Markov chain 



is considered in Exercise 3.6.10 In Exercises 3.6.12 and 3.6.13, the case of 



random variables having the form 

^ n— 1 

Qn = -y^Q{xi,Xi+i) (3.11) 



is covered. This type of RV arises in statistical physics when studying 
currents in stochastic models of interacting particles (see [30]). The tilted 
matrix Ilk associated with Q„ has the form Tfk{x,x') = TT{x'\x)e'"^^^'^ \ 



3.3 Continuous-time Markov processes 

The preceding subsection can be generalized to ergodic Markov processes 
evolving continuously in time by taking a continuous time limit of discrete- 
time ergodic Markov chains. 

To illustrate this limit, consider an ergodic continuous-time process 
described by the state X{t) for < t < T. For an infinitesimal time-step At, 
time-discretize this process into a sequence of n+1 states Xq, Xi, . . . , Xn with 
n = T/ At and Xi = X{iAt), z = 0, 1, . . . , n. The sequence Xo,Xi, . . . , Xn is 
an ergodic Markov chain with infinitesimal transition matrix n(At) given by 

n(At) = 7r(x(t + At)\x{t)) = e^^^ = 1 + GAt + o{At), (3.12) 

where I is the identity matrix and G the generator of X{t). With this 
discretization, it is now possible to study LDPs for processes involving X{t) 
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by studying these processes in discrete time at the level of the Markov chain 
Xq,Xi, . . . , Xn, and transfer these LDPs into continuous time by taking the 
limits Ai — )• 0, n — )• oo. 

As a general application of this procedure, consider a so-called additive 
process defined by 

1 



St = j; (3-13) 
The discrete-time version of this RV is the sample mean 

i=0 1=0 

From this association, we find that the SCGF of St, defined by the limit 
A(A:) = lim ;^lnS[e™^^], k €R, (3.15) 

is given by 

A(A:) = lim lim ^ln^[e''^*^^=o^»] (3.16) 

at the level of the discrete-time Markov chain. 

According to the previous subsection, the limit above reduces to In ({Ilk{At)), 
where 11^ (At) is the matrix (or operator) corresponding to 

nfc(At) = e'^^*^'n(At) 

= (1 + kx'At + o(At)) (/ + G'At + o(At)) 

= I + GkAt + o{At), (3.17) 

where 

Gk = G + kx'6:^^^>. (3.18) 
For the continuous-time process X{t), we must therefore have 

Xik) = aGk), (3.19) 

where C{Gk) is the dominant eigenvalue of the tilted generator Gfcj^ Note 
that the reason why the logarithmic does not appear in the expression of 
X{k) for the continuous-time process is because C is now the dominant 

'^^As for Markov chains, this result holds for continuous, ergodic processes with finite 
state-space. For infinite-state or continuous-space processes, a similar result holds provided 
Gk has an isolated dom inan t eigenvalue. 

^'^ Compare with Eq. K^. 
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eigenvalue of the generator of the infinitesimal transition matrix, which is 
itself the exponential of the generator. 

With the knowledge of X{k), we can obtain an LDP for St using the GE 
Theorem: If the dominant eigenvalue is differentiable in k, then St satisfies 
an LDP in the long-time limit, T — )• c«, which we write as PSri^) ^ e~"^^^*\ 
where I(s) is the Legendre-Fenchel transform A(A;). Some applications of 
this result are presented in Exercises 3.6.11| and |3.6.13[ Note that in the case 



of current-type processes having the form of Eq. (3.11 ) , the ti lted generator 
is not simply given by Gfc = G + kx'6x,x'', see Exercise 



3.6.13 



3.4 Paths large deviations 

We complete our tour of mathematical applications of large deviation theory 
by studying a different large deviation limit, namely, the low- noise limit of 
the following stochastic differential equation (SDE for short): 

xit) = f{x{t)) + V^at), x(0) = 0, (3.20) 

which involves a force f{x) and a Gaussian white noise (^(t) with the 
properties E[^{t)] = and E[^{t')^{t)] = 6{t' - t); see Chap. XVI of [19] for 
background information on SDEs. 

We are interested in studying for this process the pdf of a given random 
path {x{t)}f^Q of duration T in the limit where the noise power e vanishes. 
This abstract pdf can be defined heuristically using path integral methods 
(see Sec. 6.1 of [48])- In the following, we denote this pdf by the functional 
notation p[x] as a shorthand for p({x(t)}^o)- 

The idea behind seeing the low-noise limit as a large deviation limit 
is that, as e — )• 0, the random path arising from the SDE above should 
converge in probability to the deterministic path x{t) solving the ordinary 
differential equation 

±it) = f{x{t)), x{0) = 0. (3.21) 

This convergence is a LLN-type result, and so in the spirit of large deviation 
theory we are interested in quantifying the likelihood that a random path 
{x{t)}J^Q ventures away from the deterministic path in the limit e — )• 0. The 
functional LDP that characterizes these path fluctuations has the form 

p[x] ^ e-^[^l/^ (3.22) 

where 

I[x]= rm-fixm'dt. (3.23) 
Jo 
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See [21] and Sec. 6.1 of [H] for historical sources on this LDP. 

The rate functional I[x] is called the action, Lagrangian or entropy 
of the path {x(t)}^Q. The names "action" and "Lagrangian" come from 
an analogy with the action of quantum trajectories in the path integral 
approach of quantum mechanics (see Sec. 6.1 of [H]). There is also a close 
analogy between the low-noise limit of SDEs and the semi-classical or WKB 
approximation of quantum mechanics [48j . 

The path LDP above can be generalized to higher-dimensional SDEs 
as well as SDEs involving state-dependent noise and correlated noises (see 
Sec. 6.1 of [48j). In all cases, the minimum and zero of the rate functional 
is the trajectory of the deterministic system obtained in the zero-noise 
limit. This is verified for the ID system considered above: I[x] > for all 



trajectories and I[x] = for the unique trajectory solving Eq. (3.21). 

Functional LDPs are the most refined LDPs that can be derived for SDEs 
as they characterize the probability of complete trajectories. Other "coarser" 
LDPs can be derived from these by contraction. For example, we might be 
interested to determine the pdf p{x, T) of the state x{T) reached after a 
time T. The contraction in this case is obvious: p{x, T) must have the large 
deviation form p{x^ T) ~ with 

Vix,T)= inf I\x\. (3.24) 

x{t):x{0)=0,x{T)=x 

That is, the probability of reaching x{T) = x from x(0) = is determined by 
the path connecting these two endpoints having the largest probability. We 
call this path the optimal path, maximum likelihood path or instanton. 
Using variational calculus techniques, often used in classical mechanics, it 
can be proved that this path satisfies an Euler-Lagrange-type equation as 
well as a Hamilton-type equation (see Sec. 6.1 of [48]). Applications of these 
equations are covered in Exercises [3.6.14 -3.6.17 



Quantities similar to the additive process St considered in the previous 



subsection can also be defined for SDEs (see Exercises 3.6.11 and 3.6.12). 
An interesting aspect of these quantities is that their LDPs can involve the 
noise power e if the low-noise limit is taken, as well as the integration time 
T, which arises because of the additive (in the sense of sample mean) nature 
of these quantities. In this case, the limit T — )• must be taken before the 
low- noise limit e — )• 0. If e — )• is taken first, the system considered is no 
longer random, which means that there can be no LDP in time. 
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3.5 Physical applications 

Some applications of the large deviation results seen so far are covered in the 
contribution of Engel to this volume. The following list gives an indication 
of these applications and some key references for learning more about them. 
A more complete presentation of the applications of large deviation theory 
in statistical physics can be found in Sees. 5 and 6 of 



Equilibrium systems: Equilibrium statistical mechanics, as embod- 
ied by the ensemble theory of Boltzmann and Gibbs, can be seen with 
hindsight as a large deviation theory of many-body systems at equilib- 
rium. This becomes evident by realizing that the thermodynamic limit 
is a large deviation limit, that the entropy is the equivalent of a rate 
function and the free energy the equivalent of a SCGF. Moreover, the 
Legendre transform of thermodynamics connecting the entropy and 
free energy is nothing but the Legendre-Fenchel transform connecting 
the rate function and the SCGF in the GE Theorem and in Varad- 
han's Theorem. For a more complete explanation of these analogies, 
and historical sources on the development of large deviation theory in 
relation to equilibrium statistical mechanics, see the book by Ellis |19| 
and Sec. 3.7 of 



Chaotic systems and multifractals: The so-called thermody- 
namic formalism of dynamical systems, developed by Ruelle [40j and 
others, can also be re-interpreted with hindsight as an application of 



large deviation theory for the study of chaotic systems^*^ There are 
two quantities in this theory playing the role of the SCGF, namely, the 
topological pressure and the structure function. The Legendre 
transform appearing in this theory is also an analogue of the one en- 
countered in large deviation theory. References to these analogies can 
be found in Sees. 7.1 and 7.2 of 



Nonequilibrium systems: Large deviation theory is becoming the 
standard formalism used in studies of nonequilibrium systems modelled 
by SDEs and Markov processes in general. In fact, large deviation 
theory is currently experiencing a sort of revival in physics and mathe- 
matics as a result of its growing application in nonequilibrium statistical 
mechanics. Many LDPs have been derived in this context: LDPs for 
the current fluctuations or the occupation of interacting particle models. 



^°There is a reference to this connection in a note of Ruelle's popular book, Chance and 
Chaos [39] (see Note 2 of Chap. 19). 
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such as the exclusion process, the zero-range process (see [28]) and 
their many variants [45j, as weh as LDPs for work- related and entropy- 
production-related quantities for nonequilibrium systems modelled with 
SDEs [6, ,:.7j . Good entry points in this vast field of research are [13j 
and l30!- 



Fluctuation relations: Several LDPs have come to be studied in 
recent years under the name of fiuctuation relations. To illustrate 
these results, consider the additive process St considered earlier and 
assume that this process admits an LDP with rate function I{s). In 
many cases, it is interesting not only to know that St admits an LDP 
but to know how probable positive fiuctuations of St are compared to 
negative fluctuations. For this purpose, it is common to study the ratio 

PSt{s) 



which reduces to 



(3.25) 

PSt[-s) 

^iy(-s)-2(s,i (3^26) 



PSri^) ^T[Ii-s)-I{s)] 



if we assume an LDP for St- In many cases, the difference I{—s) — I{s) 
is linear in s, and one then says that St satisfies a conventional 
fluctuation relation, whereas if it nonlinear in s, then one says 
that St satisfies an extended fluctuation relation. Other types of 
fluctuation relations have come to be defined in addition to these; for 
more information, the reader is referred to Sec. 6.3 of [48]. A list of 
the many physical systems for which fluctuation relations have been 
derived or observed can also be found in this reference. 

3.6 Exercises 

3.6. iJ"*"^' (IID sample means) Use the GE Theorem to find the rate function 
of the IID sample mean Sn for the following probability distribution and 
densities: 

• Gaussian: 

• Bernoulh: Xi £ {0, 1}, P{Xi = 0) = 1 - a, P{Xi = 1) = a. 

• Exponential: 

p(x) = -e-''/>', xG[0,oo), fi>0. (3.28) 
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• Uniform: 

U) = 11 xe[0,a] 
10 otherwise. 



p(x) = -—^ — ^, X G M, cr > 0. (3.30) 



• Cauchy: 

3.6.2.[-'-^] (Nonconvex rate functions) Rate functions obtained from the GE 
Theorem are necessarily convex (strictly convex, in fact; see Exercise 2.7.9), 
but rate functions in general need not be convex. Consider, as an example, 
the process defined by 



Sn = Y + 



-VXi, (3.31) 

n ^ — ^ 



n 

i=l 

where Y = ±1 with probability ^ and the Xj's are Gaussian IID RVs. Find 
the rate function for 5.„ assuming that Y is independent of the Xj's. Then find 
the corresponding SCGF. What is the relation between the Legendre-Fenchel 
transform of X{k) and the rate function /(s)? How is the nonconvexity of 
I{s) related to the differentiability of A(A;)? (See Example 4.7 of [48j for the 
solution.) 

3.6.3. t-*-^] (Exponential mixture) Repeat the previous exercise by replacing 
the Bernoulli Y with Y = Z/n, where Z is an exponential RV with mean 1. 
(See Example 4.8 of [H] for the solution.) 

3.6.4. l-'-'^] (Product process) Find the rate function of 

/ n \ I/J^ 

^n=(n^O (3.32) 

where Xi G {1, 2} with p{Xi = 1) = a and p{Xi = 2) = 1 - a, < a < 1. 

3.6.5. t-'-^] (Self-process) Consider a sequence of IID Gaussian RVs Xi, . . . , Xn- 
Find the rate function of the so-called self-process defined by 

Sn = --lnp{Xi,...,Xn). (3.33) 

n 

(See Sec. 4.5 of [Mj for information about this process.) Repeat the problem 
for other choices of pdf for the RVs. 
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3.6.6. (Iterated large deviations [35]) Consider the sample mean 

^ m 

^-." = -E^^"^ (3.34) 

i=l 

involving m IID copies of a random variable X^'^\ Show that if X*^") satisfies 
an LDP of the form 

p^,„)(x)«e-"^(-), (3.35) 
then 5m, n satisfies an LDP having the form 

Ps™,„(s)«e— '^^(-). (3.36) 

3.6.7. (Persistent random walk) Use the result of the previous exercise 
to find the rate function of 



1 " 

n = -Vx„ (3.37) 
n ^-^ 



where the Xj's are independent Bernoulli RVs with non- identical distribution 
P{Xi = 0) = a' and P{Xi = 1) = 1 - a* with < a < 1. 

3.6.8. t-*-^] (Sanov's Theorem) Consider the empirical vector L„ with compo- 
nents defined in Eq. ( |3.3| ). Derive the expression found in Eq. (3.5) for 

A(k) = lim -ln^[e"'^-^"], (3.38) 

n— >oo n 

where k • L„ denotes the scalar product of k and L^,. Then obtain the rate 
function found in (3.7) by calculating the Legendre transform of A(k). Check 
explicitly that the rate function is convex and has a single minimum and 
zero. 

3.6.9. (Contraction principle) Repeat the first exercise of this section 
by using the contraction principle. That is, use the rate function of the 
empirical vector L„ to obtain the rate function of 5„. What is the mapping 
from L„ to 5„? Is this mapping many-to-one or one-to-one? 

3.6.10. ['^'''] (Bernoulli Markov chain) Consider the Bernoulli sample mean 

1 " 

Sn = -y^Xi, X,e{0,l}. (3.39) 
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Find the expression of A (A;) and I{s) for this process assuming that the Xj's 
form a Markov process with symmetric transition matrix 

7t{x'\x) = I = ^ (3.40) 

1^ Q X X 

with < Q < 1. (See Example 4.4 of [IS] for the solution.) 

3.6. 11. [^''l (Residence time) Consider a Markov process with state space 
A' = {0, 1} and generator 

G=f-" (3.41) 



Find for this process the rate function of the random variable 



1 



T 



Lt = ^J^ 6xit)fldt, (3.42) 

which represents the fraction of time the state X{t) of the Markov chain 
spends in the state X = over a period of time T. 

3.6. 12. (Current fluctuations in discrete time) Consider the Markov chain 
of Exercise |3.6.10[ Find the rate function of the mean current Qn defined 
as 

1 " 

Qn = -y^J{xi+i,Xi), (3.43) 

= (344) 

Qn represents the mean number of jumps between the states and 1: at each 
transition of the Markov chain, the current is incremented by 1 whenever a 
jump between the states and 1 occurs. 

3.6. 13. f^^^ (Current fluctuations in continuous time) Repeat the previous 
exercise for the Markov process of Exercise 3.6. 11| and the current Qt defined 

as 

^ 71—1 

= lim - V/(xi+i,Xi), (3.45) 
i=0 

where Xi = x{iAt) ad f{x',x) as above, i.e., f{xi+i,Xi) = 1 — 5x^.x^+i- Show 
for this process that the tilted generator is 
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3.6.14. [-'-'^l (Ornstein-Uhlenbeck process) Consider the following linear SDE: 

x{t) = --fx{t) + V^m, (3.47) 

often referred to as the Langevin equation or Ornstein-Uhlenbeck pro- 
cess. Find for this SDE the solution of the optimal path connecting the 
initial point x{0) = with the fluctuation x{T) = x. From the solution, find 
the pdf p{x, T) as well as the stationary pdf obtained in the limit T — t- oo. 
Assume e — )• throughout but then verify that the results obtained are valid 
for aU e > 0. 

3.6.15. [^°] (Conservative system) Show by large deviation techniques that 
the stationary pdf of the SDE 

i(t) = - V[/(x(t)) + ^ (3.48) 

admits the LDP p{x) ~ with rate function V{x.) = 2C/(x). The rate 

function V{x.) is called the quasi-potential. 

3.6. 16. ["^^J (Transversal force) Show that the LDP found in the previous 
exercise also holds for the SDE 

±{t) = -VUix{t)) + A(x) + ^/e e(t) (3.49) 

if V?7 . A = 0. (See Sec. 4.3 of [21] for the solution.) 

3.6.17. ['^^1 (Noisy Van der Pol oscillator) The following set of SDEs describes 
a noisy version of the well-known Van der Pol equation: 

X = V 

V = -X + v{a - x'^ - v^) + \/e (3.50) 

The parameter a G M controls a bifurcation of the zero-noise system: for 
a < 0, the system with ^ = has a single attracting point at the origin, 
whereas for a > 0, it has a stable limit cycle centered on the origin. Show 
that the stationary distribution of the noisy oscillator has the large deviation 
form p{x^ v) ~ Q-U{x,v)/£ g _^ Q ^^\^\^ j-ate function 

U{x, v) = -a{x^ + v"^) + + v"^?- (3.51) 

Find a different set of SDEs that has the same rate function. (See [26J for 
the solution.) 
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S-G-ISJ^''^ (Dragged Brownian particle |50) ) The stochastic dynamics of a 
tiny glass bead immersed in water and pulled with laser tweezers at constant 
velocity can be modelled in the overdamped limit by the following reduced 
Langevin equation: 

X = -{x{t) - ut) + ^/iC(^), (3.52) 

where x{t) is the position of the glass bead, u the pulling velocity, ^(t) a 
Gaussian white noise modeling the influence of the surrounding fluid, and e 
the noise power related to the temperature of the fluid. Show that the mean 
work done by the laser as it pulls the glass bead over a time T, which is 
defined as ^ 

VFt = -^ / {x{t)-vt)dt, (3.53) 
^ Jo 

satisfies an LDP in the limit T — )• oo. Derive this LDP by assuming first that 
e — )• 0, and then derive the LDP without this assumption. Finally, determine 
whether Wt satisfies a conventional or extended fluctuation relation, and 
study the limit — t- 0. (See Example 6.9 of |48j for more references on this 
model.) 

4 Numerical estimation of large deviation proba- 
bilities 

The previous sections might give the false impression that rate functions 
can be calculated explicitly for many stochastic processes. In fact, exact 
and explicit expressions of rate functions can be found only in few simple 
cases. For most stochastic processes of scientific interest (e.g,, noisy nonlinear 
dynamical systems, chemical reactions, queues, etc.), we have to rely on 
analytical approximations and numerical methods to evaluate rate functions. 

The rest of these notes attempts to give an overview of some of these 
numerical methods and to illustrate them with the simple processes (IID 
sample means and Markov processes) treated before. The goal is to give a 
flavor of the general ideas behind large deviation simulations rather than 
a complete survey, so only a subset of the many methods that have come 
to be devised for numerically estimating rate functions are covered in what 
follows. Other methods not treated here, such as the transition path method 
discussed by Dellago in this volume, are mentioned at the end of the next 
section. 
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4.1 Direct sampling 

The problem addressed in this section is to obtain a numerical estimate of the 
pdf (s) for a real random variable Sn satisfying an LDP, and to extract 
from this an estimate of the rate function I(s)|^ To be general, we take Sn 
to be a function of n RVs Xi , . . . , X„ , which at this point are not necessarily 
IID. To simplify the notation, we will use the shorthand cj = Xi, . . . , X„. 
Thus we write 5„ as the function and denote by piu:) the joint pdf of 

the Xi's 
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Numerically, we cannot of course obtain ps„{s) or /(s) for all s G M, but 
only for a finite number of values s, which we take for simplicity to be equally 
spaced with a small step As. Following our discussion of the LDP, we thus 
attempt to estimate the coarse-grained pdf 

_ PiSnG[s,s + As]) _ PjSn e As) 
P3-^'>- As " As ' ^ ' 

where As denotes the small interval [s, s + As] anchored at the value sj^ 

To construct this estimate, we follow the statistical sampling or Monte 
Carlo method, which we broke down into the following steps (see the 
contribution of Katzgraber in this volume for more details) : 

1. Generate a sample of L copies or realizations of the sequence 
u from its pdf p{uj). 

2. Obtain from this sample a sample {s^^^j'^^ of values or realizations 
for Sn- 

s(^) = S„(cu(^')), j = l,...,L. (4.2) 

3. Estimate P{Sn G A^) by calculating the sample mean 

Pi(A,) = i^lA.(.(^)), (4.3) 
i=i 

where IaIx) denotes the indicator function for the set A, which is 
equal to 1 if x G A and otherwise. 

■^^The literature on large deviation simulation usually considers the estimation of P{Sn G 
A) for some set A rather than the estimation of the whole pdf of Sn- 

^■^For simplicity, we consider the Xi's to be real RVs. The case of discrete RVs follow 
with slight changes of notations. 

■^^ Though not explicitly noted, the coarse-grained pdf depends on As; see Footnote |9] 



30 



Figure 4.1: (Left) Naive sampling of the Gaussian IID sample mean (// = 
a = 1) for n = 10 and different sample sizes L. The dashed line is the exact 
rate function. As L grows, a larger range of I(s) is sampled. (Right) Naive 
sampling for a fixed sample size L = 10 000 and various values of n. As n 
increases, In,L{s) approaches the expected rate function but the sampling 
becomes inefficient as it becomes restricted to a narrow domain. 

4. Turn the estimator Pl{A.s) of the probability P{Sn € Ag) into an 
estimator Pl[s) of the probability density Ps„{s): 



The result of these steps is illustrated in Fig. 4.1 for the case of the IID 



Gaussian sample mea n. N ote that pl{s) above is nothing but an empirical 



vector for Sn (see Sec. 3.1) or a density histogram of the sample {s^^^}j'^i. 

The reason for choosing p(s) as our estimator of PSni^) is that it is an 
unbiased estimator in the sense that 

E[pl{s)]=PsAs) (4.5) 

for all L. Moreover, we know from the LLN that piis) converges in probability 
to its mean P5„(s) as L — )• oo. Therefore, the larger our sample, the closer 
we should get to a valid estimation of P5„(s). 

To extract a rate function I(s) from piis), we simply compute 

/n,L(s) = --lnj5i(s) (4.6) 
n 

and repeat the whole process for larger and larger integer values of n and L 
until In,L{s) converges to some desired level of accuracy 
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*Note that Pl(As) and pl{s) differ only by the factor As. /„,l(s) can therefore be 
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4.2 Importance sampling 

A basic rule of thumb in statistical sampling, suggested by the LLN, is that 
an event with probability P will appear in a sample of size L roughly LP 
times. Thus to get at least one instance of that event in the sample, we must 
have L > 1/P as an approximate lower bound for the size of our sample (see 



Exercise 4.7.5 for a more precise derivation of this estimate). 

Applying this result to PSnis), we see that, if this pdf satisfies an LDP 
of the form ps„is) ~ q-^-H'^) ^ then we need to have L > e'^^(^) to get at least 
one instance of the event Sn S Ag in our sample. In other words, our sample 
must be exponentially large with n in order to see any large deviations (see 



Fig. 4.1 and Exercise 4.7.2) 



This is a severe limitation of the sampling scheme outlined earlier, and 
we call it for this reason crude Monte Carlo or naive sampling. The 
way around this limitation is to use importance sampling (IS for short) 
which works basically as follows (see the contribution of Katzgraber for more 
details) : 

1. Instead of sampling the Xj's according to the joint pdi p(lj), sample 
them according to a new pdf q{uj) 
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2. Calculate instead oi pl{s) the estimator 

1 ^ 

Us) = -^5]lA.(5n(cc.(^)))/?(c^(^)), (4.7) 

where 

R{u:) = ^ (4.8) 

is the called the likelihood ratio. Mathematically, R also corresponds 
to the Radon-Nikodym derivative of the measures associated with 
p and gpl 

The new estimator qL{s) is also an unbiased estimator oi ps„{s) because 

Eg[qUs)]=Ep\pLis)]. (4.9) 



(see Exercise 4.7.3) 



computed from either estimator with a difference (In As)/n that vanishes as n — >■ oo. 

^^The pdf q must have a support at least as large as that of p, i.e., q{uj) > if p{lj) > 0, 
otherwise the ratio _R„ is ill-defined. In this case, we say that q is relatively continuous 
with respect to p and write q^ p. 

■^^The Radon-Nikodym derivative of two measures /i and u such that v ^ /i is defined 
as dfi/df. If these measures have densities p and q, respectively, then dp./dv — p/q. 



32 



However, there is a reason for choosing ^l(s) as our new estimator: we might 
be able to come up with a suitable choice for the pdf q such that ^l(s) has a 
smaller variance than pl{s)- This is in fact the goal of IS: select q{ijj) so as 
to minimize the variance 

var,(gi(s)) = E,[{qL{s) - v{s)f]. (4.10) 

If we can choose a q such that Ya.iq{qi{s)) < varp(pi(s)), then qiis) will 
converge faster to ps„is) than as we increase L. 



It can be proved (see Exercise 4.7.6) that in the class of all pdfs that 



are relatively continuous with respect to p there is a unique pdf q* that 
minimizes the variance above. This optimal IS pdf has the form 

q*{u) = =p(c^|5„(c^) = s) (4.11) 

and has the desired property that var^* (^^(s)) = 0. It does seem therefore 
that our sampling problem is solved, until one realizes that q* involves the 
unknown pdf that we want to estimate, namely, ps„is)- Consequently, q* 
cannot be used in practice as a sampling pdf. Other pdfs must be considered. 

The next subsections present a more practical sampling pdf, which can be 
proved to be optimal in the large deviation limit n — )• oo. We will not attempt 
to justify the form of this pdf nor will we prove its optimality. Rather, we will 
attempt to illustrate how it works in the simplest way possible by applying 
it to IID sample means and Markov processes. For a more complete and 
precise treatment of IS of large deviation probabilities, see jl] and Chap. VI 
of [2]. For background information on IS, see Katzgraber (this volume) and 
Chap. V of [2]. 

4.3 Exponential change of measure 

An important class of IS pdfs used for estimating ps„ (s) is given by 

gnfc5„(a;) 

where A; € M and Wnik) is a normalization factor given by 

Wn{k)=Ep[Q'^^^-]= f e"^'^"('^)j5(a;)da;. (4.13) 

We call this class or family of pdfs parameterized by k the exponential 
family. In large deviation theory, pk is also known as the tilted pdf 
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associated with p or the exponential twisting of p^j The Ukehhood ratio 
associated with this change of pdf is 

R{u) = e-"'='5"('^) Wn{k), (4.14) 
which, for the purpose of estimating /(s), can be approximated by 

R{u) ^ e-nlkS„(u')^m] (4.15) 

given the large deviation approximation Wn{k) ~ e"''^^'^''. 

It is important to note that a single pdf of the exponential family with 
parameter k cannot be used to efficiently sample the whole of the function 
Ps„is) but only a particular point of that pdf. More precisely, if we want to 
sample PSn{^) the value s, we must choose k such that 

Ep,[Ql{s)]=PsAs) (4.16) 
which is equivalent to solving A'(A;) = s for k, where 

A(A:) = lim - In ^„[e"'='^"] (4.17) 

n— ^oo Tl 



is the SCGF of 5^ defined with respect to p (see Exercise 4.7.7). Let 
us denote the solution of these equations by k{s). For many processes 
Sn, it can be proved that the tilted pdf Pk{s) corresponding to k{s) is 
asymptotically optimal in n, in the sense that the variance of qiis) under 
Pk{s) goes asymptotically to as n — t- oo. When this happens, the convergence 
of qiis) towards ps„is) is fast for increasing L and requires a sub-exponential 
number of samples to achieve a given accuracy for In, Lis)- To obtain the full 
rate function, we then simply need to repeat the sampling for other values of 
s, and so other values of k, and scale the whole process for larger values of n. 

In practice, it is often easier to reverse the roles of k and s in the sampling. 
That is, instead of fixing s and selecting k = k{s) for the numerical estimation 
of I{s), we can fix k to obtain the rate function at s{k) = A'(fc). This way, 
we can build a parametric representation of I{s) over a certain range of s 
values by covering or "scanning" enough values of k. 

At this point, there should be a suspicion that we will not achieve much 
with this exponential change of measure method (ECM method short) 
since the forms oi pk and qiis) presuppose the knowledge of Wn{k) and 
in turn A(A;). We know from the GE Theorem that X{k) is in many cases 



^^In statistics and actuarial matliematics, pk is known as tiie associated law or Esscher 
transform of p. 
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sufficient to obtain /(s), so why taking the trouble of sampling QLis) to get 

The answer to this question is that I{s) can be estimated from pk without 
knowing Wn{k) using two insights: 

• Estimate I{s) indirectly by sampling an estimator of X{k) or of Sn, 
which does not involve Wn{k), instead of sampling the estimator qL{s) 
of PS As); 

• Sample u according to the a priori pdf p{uj) or use the Metropolis 
algorithm, also known as the Markov chain Monte Carlo algorithm, to 
sample u according to Pk{^) without knowing Wn{k) (see Appendix [A| 
and the contribution of Katzgraber in this volume) . 



These points will be explained in the next section (see also Exercise 4.7.11). 

For the rest of this subsection, we will illustrate the ECM method for a 
simple Gaussian IID sample mean, leaving aside the issue of having to know 
Wn{k). For this process 

n 

Pk{uj) = Pkixi, ...,Xn) = Y\_Pk{Xi), (4.18) 

i=l 

where 

Pkix^) = W{k) = E.ie"''], (4.19) 

with p{xi) a Gaussian pdf with mean /i and variance a^. We know from 
Exercise 3.6.1 the expression of the generating function W{k) for this pdf: 

W{k) = e''f'+'^''\ (4.20) 
The explicit expression of pfc(xj) is therefore 

Pkixi) = e'^^^^-w-^fe ; = —= . (4.21) 

The estimated rate function obtained by sampling the XiS according to 



this Gaussian pdf with mean ^ + a k and variance a is shown in Fig. 4.2 



for various values of n and L. This figure should be compared with Fig. 4.1 



which illustrates the naive sampling method for the same sample mean. It is 
obvious from these two figures that the ECM method is more efficient than 
naive sampling. 



To make matters worst, the sampling of (Jl(s) does involve /(s) directly; see Exer- 
1 14.7.101 
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Figure 4.2: IS for the Gaussian IID sample mean {fj, = a = 1) with the 
exponential change of measure. 



As noted before, the rate function can be obtained in a parametric way 
by "scanning" k instead of "scanning" s and fixing k according to s. If we 
were to determine k{s) for a given s, we would find 

X'(k) = fj, + a'^k = s, (4.22) 

so that k{s) = (s — /u)/cj^. Substituting this result in Pk{xi), then yields 

Pk{s){xi) = — ^T^^p — • (^-2^) 

Thus to efficiently sample J3s„(s), we must sample the Xj's according to a 
Gaussian pdf with mean s instead of the a priori pdf with mean ^u. The 
sampling is efficient in this case simply because Sn concentrates in the sense 
of the LLN at the value s instead of fi, which means that s has become the 
typical value of Sn under Pk(s)- 

The idea of the ECM method is the same for processes other than IID 
sample means: the goal in general is to change the sampling pdf from p to pk 
in such a way that an event that is a rare event under p becomes a typical 



under Pk "^^ For more information on the ECM method and IS in general; 
see [4J and Chaps. V and VI of [2]. For applications of these methods, see 
Exercise 14.7.91 



4.4 Applications to Markov chains 

The application of the ECM method to Markov chains follows the IID case 

(?) (7) 

closely: we generate L IID realizations , . . . ,Xn , j = 1, . . . ,L, of the 

■^^Note that this change is not always possible: for certain processes, there is no k that 
makes certain events typical under pt- For examples, see jl]. 
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chain u = Xi, . . . , X„ according to the tilted joint pdf Pfe(cj), using either 
Pk directly or a Metropolis-type algorithm. The value k{s) that achieves 
the efficient sampling of Sn at the value s is determined in the same way as 



in the IID case by solving A'(A;) = s, where X{k) is now given by Eq. (3.9). 
In this case, Sn = s becomes the typical event of Sn in the limit n — )• oo. 
A simple application of this procedure is proposed in Exercise |4.7.14 For 



further reading on the sampling of Markov chains, see [U H2t HSj EJ I41j . 

An important difference to notice between the IID and Markov cases is 
that the ECM does not preserve the factorization structure of p{u:) for the 
latter case. In other words, Pki^) does not describe a Markov chain, though, 
as we noticed, Pk{<^) is a product pdf when p{u) is itself a product pdf. In 
the case of Markov chains, Pk{<^) does retain a product structure that looks 
like a Markov chain, and one may be tempted to define a tilted transition 
pdf as 

T^k{x'\x) = ^^'IJ!^?^^ , W{k\x)= [ e^''\{x'\x)dx'. (4.24) 



W{k\x) 

However, it is easy to see that the joint pdf of cj obtained from this transition 
matrix does not reproduce the tilted pdf Pk{<^)- 



4.5 Applications to continuous-time Markov processes 

The generalization of the results of the previous subsection to continuous-time 
Markov processes follows from our discussion of the continuous-time limit 



of Markov chains (see Sec. 3.2). In this case, we generate, according to the 
tilted pdf of the process, a sample of time-discretized trajectories. The choice 
of At and n used in the discretization will of course influence the precision of 
the estimates of the pdf and rate function, in addition to the sample size L. 

Similarly to Markov chains, the tilted pdf of a continuous-time Markov 
process does not in general describe a new "tilted" Markov process having a 
generator related to the tilted generator Gk defined earlier. In some cases, 
the tilting of a Markov process can be seen as generating a new Markov 
process, as will be discussed in the next subsection, but this is not true in 
general. 



4.6 Applications to stochastic differential equations 

Stochastic differential equations (SDEs) are covered by the continuous-time 
results of the previous subsection. However, for this type of stochastic 
processes the ECM can be expressed more explicitly in terms of the path 
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pdf p[x] introduced in Sec. 3.4 The tilted version of this pdf, which is the 
functional analogue of pk{uj), is written as 



where St[x] is some functional of the trajectory {x(t)}^o 

Wrik) = Ep[e^''^^]. (4.26) 
The likelihood ratio associated with this change of pdf is 

= = ^-TkSrM wrik) ^ e-^[kSTM-m]_ (4.27) 

Pkm 

As a simple application of these expressions, let us consider the SDE 

x{t)=m, (4.28) 
where S,{t) is a Gaussian white noise, and the following additive process: 



1 



T 



Dt[x] = - x{t)dt, (4.29) 

which represents the average velocity or drift of the SDE. What is the form 
of in this case? Moreover, how can we use this tilted pdf to estimate 
the rate function 1(d) of Dj- in the long-time limit? 

An answer to the first question is suggested by recalling the goal of the 
ECM. The LLN implies for the SDE above that Dt — >• in probability as 
T — )• oo, which means that Dt = is the typical event of the "natural" 
dynamic of x{t). The tilted dynamics realized by should change this 

typical event to the fluctuation Dt = d; that is, the typical event of Dt 
under should be Dt = d rather than Dt = 0. One candidate SDE 

which leads to this behavior of Dt is 

x{t) = d + ^{t), (4.30) 

so the obvious guess is that Pk{d) i^] is the path pdf of this SDE. 

Let us show that this is a correct guess. The form of Pk[x] according to 
Eq. ( [4251 ) is 

Pfcix] P.e-"'=W, Ik[x] = I[x]-kDT[x] + X{k), (4.31) 
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where 



r 

2T 



I[x] = ^ I x'^dt (4.32) 







is the action of our original SDE and 



X{k)= hm ;^ln^[e^*^^T] (4.33) 

is the SCGF of Dt- The expectation entering in the definition of X(k) can 
be calculated explicitly using path integrals with the result A(A;) = k'^ /2. 
Therefore, 

Ik[x] = I[x]-kDT[x]-^. (4.34) 

From this result, we find Pj!c((i)[x] by solving X'{k(d)) = d, which in this 
case simply yields k{d) = d. Hence 

4(d) N = I[x] - dDrix] + y = ^ j^i " df dt, (4.35) 



which is exactly the action of the boosted SDE of Eq. (4.30). 

Since we have A(A:), we can of course directly obtain the rate function 
I{d) of Dt by Legendre transform: 

d"^ 

I{d) = k{d)d - \{k{d)) = —. (4.36) 
This shows that the fluctuations of Dt are Gaussian, which is not surprising 



given that Dt is up to a factor 1/T a Brownian motion Note that the 



same result can be obtained following Exercise 4.7.10 by calculating the 
likelihood ratio R[x] for a path {xd{t)}f^Q such that DtIxj] = d, i.e., 

R[x,] = ^ e-^'^'\ (4.37) 

These calculations give a representative illustration of the ECM method 
and the idea that the large deviations of an SDE can be obtained by replacing 
this SDE by a boosted or effective SDE whose typical events are large 
deviations of the former SDE. Exercises 14.7.161 and 14.7.171 show that effective 
SDEs need not correspond exactly to an ECM, but may in some cases 
reproduce such a change of measure in the large deviation limit T — t- cxd. 



^"Brownian motion is defined as the integral of Gaussian white noise. By writing x = £,{t), 
we therefore define x{t) to be a Brownian motion. 
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The important point in all cases is to be able to calculate the likelihood 
ratio between a given SDE and a boosted version of it in order to obtain the 
desired rate function in the limit T — )• ooP^ 

For the particular drift process Dt studied above, the likelihood ratio 
happens to be equivalent to a mathematical result known as Girsanov's 



formula, which is often used in financial mathematics; see Exercise 4.7.15 



4.7 Exercises 

4.7.1. [^1 (Variance of Bernoulh RVs) Let X be a Bernoulh RV with P{X = 
1) = Q and P{X = 0) = 1 — a, < Q < 1. Find the variance of X in terms 
of a. 

4. 7. 2. (Direct sampling of IID sample means) Generate on a computer a 
sample of size L = 100 of n IID Bernoulli RVs Xi, . . . , Xn with bias a = 0.5, 
and numerically estimate the rate function I{s) associated with the sample 
mean Sn of these RVs using the naive estimator pl{s) and the finite-size 



rate function In,L{s) defined in Eq. (4.6). Repeat for L = 10^, 10^, 10^, and 
10^ and observe how In,L{s) converges to the exact rate function given in 



Eq. ( 2.10[ ). Repeat for exponential RVs. 



4.7.3.[-'-°] (Unbiased estimator) Prove that qL{s) is an unbiased estimator of 



P5„(s), i-e., prove Eq. (4.9). 



4.7.4.[-'-^] (Estimator variance) The estimator pl{s) is actually a sample mean 
of Bernoulli RVs. Based on this, calculate the variance of this estimator with 



respect to p{u^) using the result of Exercise |4.7.1 above. Then calculate the 



variance of the importance estimator ^l(s) with respect to q{ijj)- 

4.7.5.[-'-^] (Relative error of estimators) The real measure of the quality of 
the estimator pl{s) is not so much its variance Yax{pL{s)) but its relative 
error err(^i(s)) defined by 

i-^ I ^^ \/var(^L(s)) 

err PL s) = . 4.38) 

PsAs) 

Use the previous exercise to find an approximation of err(pi(s)) in terms of 
L and n. From this result, show that the sample size L needed to achieve a 
certain relative error grows exponentially with n. (Source: Sec. V.I of ^.) 



^^The limit e — > can also be considered. 
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4.7.6^^^ (Optimal importance pdf) Show that the density q* defined in 



(4.11) is optimal in the sense that 

Yaiqiqiis)) > Yaiq*{qL{s)) = (4.39) 
for all q relatively continuous to p. (See Chap. V of [2j for help.) 
4.7.7^^^ (Exponential change of measure) Show that the value of k solving 



Eq. (4.16) is given by \'{k) = s, where A(A;) is the SCGF defined in Eq. (4.3) 



4. 7. 8.'^''] (LDP for estimators) We noticed before that pl{s) is the empirical 
vector of the IID sample {Sn^}j'^i. Based on this, we could study the LDP 
of pl(s) rather than just its mean and variance, as usually done in sampling 
theory. What is the form of the LDP associated with pl{s) with respect to 
p{uj)7 What is its rate function? What is the minimum and zero of that 
rate function? Answer the same questions for ^l(s) with respect to g(cj). 



(Hint: Use Sanov's Theorem and the results of Exercise [3.6.6 



4.7.9.[^^] (Importance sampling of IID sample means) Implement the ECM 
method to numerically estimate the rate function associated with the IID 



sample means of Exercise 3.6.1 Use a simple IID sampling based on the 
explicit expression of pk in each case (i.e., assume that W(k) is known). 
Study the convergence of the results as a function of n and L as in Fig. |4.2[ 

4.7.10.['^^] (IS estimator) Show that qiis) for Pk(s) has the form 

1 ^ 

QLis) « ^ E 1a.(^^^'^) e-"^(^'^'). (4.40) 
i=i 

Why is this estimator trivial for estimating I{s)? 



4.7.9 



4.7.11.[^°] (Metropolis sampling of IID sample means) Repeat Exercise 
but instead of using an IID sampling method, use the Metropolis algorithm 
to sample the Xj's in Sn- (For information on the Metropolis algorithm, see 
Appendix [a}) 

4.7.12J-'-^] (Tilted distribution from contraction) Let Xi, . . . ,Xn be a se- 
quence of real IID RVs with common pdf p{x), and denote by 5„ and Ln{x) 
the sample mean and empirical functional, respectively, of these RVs. Show 
that the "value" of Ln{x) solving the minimization problem associated with 
the contraction of the LDP of Ln{x) down to the LDP of Sn is given by 

/^.(x) = ^J^, W{k) = E,[e'^''], (4.41) 
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where k is such that X'{k) = (InVF(fc))' = s. To be more precise, show that 
this pdf is the solution of the minimization problem 



where 



inf I(^), (4.42) 



= [ dx^(x)ln44 (4-43) 



is the continuous analogue of the relative entropy defined in Eq. (3.7) and 



f{fi) = / x^j.{x)dx (4.44) 
Jr 

is the contraction function. Assume that W{k) exists. Why is //^ the same 
as the tilted pdf used in IS? 

4.7.13. [^°] (Equivalence of ensembles) Comment on the meaning of the 
following statements: (i) The optimal pdf q* is to the microcanonical ensemble 
what the tilted pdf pk is to the canonical ensemble, (ii) Proving that pk 
achieves a zero variance for the estimator qL{s) in the limit n — )• oo is the 
same as proving that the canonical ensemble becomes equivalent to the 
microcanonical ensemble in the thermodynamic limit. 

4.7.14. [^^] (Tilted Bernoulli Markov chain) Find the expression of the tilted 



joint pdf pk{<j^) for the Bernoulli Markov chain of Exercise 3.6.10 Then use 
the Metropolis algorithm (see Appendix [A]) to generate a large-enough sample 
of realizations of u in order to estimate pSnis) and I{s). Study the converge 
of the estimation of I{s) in terms of n and L towards the expression of I{s) 
found in Exercise 3.6. 10[ Note that for a starting configuration xi, . . . ,Xn 



and a target configuration x'l, . . . ,x'^, the acceptance ratio is 



;) _ efe<p(^;)nr=2e^"'^(^^k^.-i) 

Pkixi, ...,Xn) e'=^ip(xi) nr=2 e''^»7r(a;i|xi_i) 



Pky^l^ ■ ■ ■ ,X 



(4.45) 



where p{xi) is some initial pdf for the first state of the Markov chain. What 
is the form of ^l(s) in this case? 

4.7.15.'^°^ (Girsanov's formula) Denote by p[x] the path pdf associated with 
the SDE 

x{t) = ^em, (4.46) 
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where ^(t) is Gaussian white noise with unit noise power. Moreover, let q[x] 
be the path pdf of the boosted SDE 



(4.47) 



Show that 



R\x] 



p[x\ 
q[x] 



exp 



e Jo 2e 



exp 



4.7.16.' J (Effective SDE) Consider the additive process, 



(4.48) 



(4.49) 



with x(t) evolving according to the simple Langevin equation of Exer- 
Show that the tilted path pdf Pk{d)[x] associated with this 



cise 



3.6.14 



process is asymptotically equal to the path pdf of the following boosted SDE: 



x(t) = -7(x(t)-s) + e(t). 



(4.50) 



More precisely, show that the action of Pfc(s)[a;] and the action J[x] 

of the boosted SDE differ by a boundary term which is of order 0(1/T). 

4. 7.1 7. (Non-exponential change of measure) Although the path pdf 



of the boosted SDE of Eq. (4.50) is not exactly the tilted path pdf Pk[x] 



obtained from the ECM, the likelihood ratio R[x] of these two pdfs does 
yield the rate function of St- Verify this by obtaining R[x] exactly and by 



using the result of Exercise 4.7.10 



5 Other numerical methods for large deviations 

The use of the ECM method appears to be limited, as mentioned before, by 
the fact that the tilted pdf Pk{<^) involves Wn{k). We show in this section 
how to circumvent this problem by considering estimators of Wn{k) and X{k) 
instead of estimators of the pdf p(5n), and by sampling these new estimators 
according to the tilted pdf pfc(a;), but with the Metropolis algorithm in order 
not to rely on Wn{k), or directly according to the a priori pdi p{u:). At the 
end of the section, we also briefly describe other important methods used in 
numerical simulations of large deviations. 
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5.1 Sample mean method 

We noted in the previous subsection that the parameter k entering in the 
ECM had to be chosen by solving A' (A;) = s in order for s to be the typical 
event of Sn under pk- The reason for this is that 

lim Ep^[Sn] = X'ik). (5.1) 

n— >cx) 

By the LLN we also have that Sn — ?■ A'(A;) in probability as n — ?■ oo if S'n is 
sampled according to pk- 

These results suggest using 

1 ^ 

SL,n(A;) = (5.2) 
i=i 

as an estimator of X'{k) by sampling u according to pk{uj), and then inte- 
grating this estimator with the boundary condition A(0) = to obtain an 
estimate of X{k). In other words, we can take our estimator for X{k) to be 

XL,n{k)= [ SL,n{k')dk'. (5.3) 
JO 

From this estimator, we then obtain a parametric estimation of I{s) from 
the GE Theorem by Legendre-transforming Xi^nik)'- 

lL,n{s) =ks- XL,n{k), (5.4) 

where s = SL,n{k) or, alternatively, 

lL,n{^)=k{s)s-XL,n{k{s)), (5.5) 

where k{s) is the root of X'i^{k) = s,^ 

The implementation of these estimators with the Metropolis algorithm is 
done explicitly via the following steps: 

1. For a given A; G M, generate an IID sample {cj^-'^jj"^]^ of L configurations 
cj = xi, . . . , a;„ distributed according to Pk{<^) using the Metropolis 
algorithm, which only requires the computation of the ratio 

Pk{uj') e"'=^"('^')p(tj') ^ • 

for any two configurations u and u'; see Appendix \A\ 

^^In practice, we have to check numerically (using finite-scale analysis) that Xn,L{k) 
does not develop non-differentiable points in fc as n and L axe increased to oo. If non- 
differentiable points arise, then I(s) is recovered from the GE Theorem only in the range 
of A'; see Sec. 4.4 of [48]. 
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Figure 5.1: (Top left) Empirical sample mean SL{k) for the Gaussian IID 
sample mean {fj, = a = 1) . The spacing of the k values is Ak = 0.5. (Top 
right) SCGF Xiik) obtained by integrating SL{k). (Bottom) Estimate of 
/(s) obtained from the Legendre transform of \L{k). The dashed line in 
each figure is the exact result. Note that the estimate of the rate function 
for L = 5 is multi- valued because the corresponding Al(A:) is nonconvex. 

2. Compute the estimator SL,n{k) of A'(A;) for the generated sample; 

3. Repeat the two previous steps for other values of k to obtain a numerical 
approximation of the function A' (A;); 

4. Numerically integrate the approximation of A'(A;) over the mesh of 
k values starting from /c = 0. The result of the integration is the 
estimator \L,n{k) of A(A;); 



5. Numerically compute the Legendre transform, Eq. ( |5.4[ ), of the estima- 
tor of X{k) to obtain the estimator lL,n{s) of I[s) at s = SL,n{k)\ 

6. Repeat the previous steps for larger values of n and L until the results 
converge to some desired level of accuracy. 
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The result of these steps are ihustrated in Fig. 5.1 for our test case of the 
IID Gaussian sample mean for which X'{k) = fJ- + a'^k. Exercise 5.4.1 



covers 



other sample means studied before (e.g., exponential, binary, etc.). 

Note that for IID sample means, one does not need to generate L real- 
izations of the whole sequence u, but only L realizations of one summand 
Xi according the tilted marginal pdf Pk{x) = e^^p{x)/W{k). In this case, 
L takes the role of the large deviation parameter n. For Markov chains, 
realizations of the whole sequence cj must be generated one after the other, 
e.g., using the Metropolis algorithm. 



5.2 Empirical generating functions 

The last method that we cover is based on the estimation of the generating 
function 

Wn{k) = E[Q''^^"]= I p(cj)e"'=^"('^)dLJ. (5.7) 



Consider first the case where Sn is an IID sample mean. Then \{k) = In W{k), 
where W{k) = E[e^-^^], as we know from Sec. 3.1 , and so the problem reduces 



to the estimation of the SCGF of the common pdf p{X) of the Xj's. An 
obvious estimator for this function is 



Xdk) = In 



1 ^ 



(5. 



where X'^^^ are the IID samples drawn from p{X). From this estimator, 
which we call the empirical SCGF, we build a parametric estimator of the 



rate function I{s) of Sn using the Legendre transform of Eq. (5.4) with 



s{k) = X'^ik) 



(5.9) 



Fig. 5.2 shows the result of these estimators for the IID Gaussian sample 
mean. As can be seen, the convergence of /l(s) is fast in this case, but is 
limited to the central part of the rate function. This illustrates one limitation 
of the empirical SCGF method: for unbounded RVs, such as the Gaussian 
sample mean, W{k) is correctly recovered for large |A:| only for large samples, 
i.e., large L, which implies that the tails of I(s) are also recovered only for 
large L. For bounded RVs, such as the Bernoulli sample mean, this limitation 



does not arise; see [HI [l5] and Exercise 5.4.2 



The application of the empirical SCGF method to sample means of 
ergodic Markov chains is relatively direct. In this case, although Wn no 
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Figure 5.2: (Top left) Empirical SCGF Xiik) for the Gaussian IID sample 
mean = a = 1). (Top right) s{k) = X'^{k). (Bottom) Estimate of I{s) 
obtained from \L{k). The dashed line in each figure is the exact result. 



longer factorizes into W{k)'^^ it is possible to "group" the Xj's into blocks of 
h RVs as follows: 

Xi + • • • + Xft + Xb+i + • • • + X2b + • • • + Xn-b+l + ---+Xn (5.10) 

' V ' ^ V ' ^ V ' 

n Y2 Ym 

where m = n/6, so as to rewrite the sample 

^ n ^ ra 

Sn = -y^Xi = --y^Yi. (5.11) 

n ^-^ bra ^-^ 

i=\ i=l 

If b is large enough, then the blocks Yi can be treated as being independent. 
Moreover, if the Markov chain is ergodic, then the l^'s should be identically 
distributed for i large enough. As a result, Wn{k) can be approximated for 
large n and large b (but 6 ^ n) as 

Wn{k) = ^[e'^^-i^'] ^ E[e''^^r, (5.12) 
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so that 

Xn{k) = -lnWnik) ^-\nE[e''^^] = hnE[e''^^]. (5.13) 
n n b 

We are thus back to our original IID problem of estimating a generating 

function; the only difference is that we must perform the estimation at the 

level of the 1^'s instead of the Xj's. This means that our estimator for X{k) 

is now 

AL,n(fc) = J In (5-14) 

where Y^^\ j = 1, . . . , L are IID samples of the block RVs. Following the IID 
case, we then obtain an estimate of I{s) in the usual way using the Legendre 



transform of Eq. (5.4) or (5.5). 



The application of these results to Markov chains and SDEs is covered 



in Exercises 5.4.2 and 5.4.3 For more information on the empirical SCGF 



method, including a more detailed analysis of its convergence, see [HI [15]. 
5.3 Other methods 

We briefly describe below a number of other important methods used for 
estimating large deviation probabilities, and give for each of them some 
pointer references for interested readers. The fundamental ideas and concepts 
related to sampling and ECM that we have treated before also arise in these 
other methods. 

• Splitting methods: Splitting or cloning methods are the most used 
sampling methods after Monte Carlo algorithms based on the ECM. 
The idea of splitting is to "grow" a sample or "population" {u} of 
sequences or configurations in such a way that a given configuration 
uj appears in the population in proportion to its weight pk{u). This 
is done in some iterative fashion by "cloning" and "killing" some 
configurations. The result of this process is essentially an estimate 
for Wn{k)^ which is then used to obtain I{s) by Legendre transform. 
For mathematically-oriented introductions to splitting methods, see 
, ISl 13 EZ] ; for more physical introductions, see [2311321 liGlliT] . 



Optimal path method: We briefly mentioned in Sec. 3.4 that rate 
functions related to SDEs can often be derived by contraction of the 
action functional I[x\. The optimization problem that results from 
this contraction is often not solvable analytically, but there are several 
techniques stemming from optimization theory, classical mechanics. 
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and dynamic programming that can be used to solve it. A typical 
optimization problem in this context is to find the optimal path that a 
noisy system follows with highest probability to reach a certain state 
considered to be a rare event or fluctuation. This optimal path is also 
called a transition path or a reactive trajectory, and can be found 
analytically only in certain special systems, such as linear SDEs and 
conservative systems; see Exercises |5.4.4 - 5.4.6 For nonlinear systems, 



the Lagrangian or Hamiltonian equations governing the dynamics of 
these paths must be simulated directly; see [26] and Sec. 6.1 of [l8] for 
more details. 

Transition path method: The transition path method is essentially 
a Monte Carlo sampling method in the space of configurations u (for 
discrete-time processes) or trajectories {x(t)}^o (fo'^ continuous-time 
processes and SDEs), which aims at finding, as the name suggests, 
transition paths or optimal paths. The contribution of Dellago in this 
volume covers this method in detail; see also [HI [121 HOl IMl E]. A 
variant of the transition method is the so-called string method \VT\ 
which evolves paths in conservative systems towards optimal paths 
connecting different local minima of the potential or landscape function 
of these systems. 

Eigenvalue method: We have noted that the SCGF of an ergodic 
continuous-time process is given by the dominant eigenvalue of its 
tilted generator. From this connection, one can attempt to numerically 
evaluate SCGFs using numerical approximation methods for linear 
functional operators, combined with numerical algorithms for finding 
eigenvalues. A recent application of this approach, based on the 
renormalization group, can be found in 



5.4 Exercises 

5.4.1. [^^] (Sample mean method) Implement the steps of the sample mean 
method to numerically estimate the rate functions of all the IID sample 
means studied in these notes, including the sample mean of the Bernoulli 
Markov chain. 

5.4.2. ["^^1 (Empirical SCGF method) Repeat the previous exercise using 
the empirical SCGF method instead of the sample mean method. Explain 
why the former method converges quickly when 5„ is bounded, e.g., in the 
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Bernoulli case. Analyze in detail how the method converges when Sn is 
unbounded, e.g., for the exponential case. (Source: OUnj.) 

5.4.3. [^^1 (Sampling of SDEs) Generalize the sample mean and empirical 
mean methods to SDEs. Study the additive process of Exercise 4.7.16| as a 
test case. 

5.4.4. (Optimal paths for conservative systems) Consider the conservative 
system of Exercise |3.6.15 and assume, without loss of generality, that the 



global minimum of the potential C/(x) is at x = 0. Show that the optimal 
path which brings the system from x(0) = to some fluctuation x(r) = x / 
after a time T is the time-reversal of the solution of the noiseless dynamics 
X = — VC/(x) under the terminal conditions x(0) = x and x(T) = 0. The 
latter path is often called the decay path. 

5.4.5. [^^1 (Optimal path for additive processes) Consider the additive process 
St of Exercise |4.7. 16| under the Langevin dynamics of Exercise 3.6.14 Show 



that the optimal path leading to the fluctuation St = s is the constant path 

x{t) = s, tG[0,T]. (5.15) 
Explain why this result is the same as the asymptotic solution of the boosted 



SDE of Eq. (4.50). 



5.4.6. [^^1 (Optimal path for the dragged Brownian particle) Apply the 



previous exercise to Exercise 3.6.18, Discuss the physical interpretation of 
the results. 
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A Metropolis-Hastings algorithm 

The Metropolis-Hastings algorithm]^ or Markov chain Monte Carlo 
algorithm is a simple method used to generate a sequence xi, . . . , x„ of 
variates whose empirical distribution L„(x) converges asymptotically to a 

^^Or, more precisely, the Metropolis-Rosenbluth-Rosenbluth-Teller-Teller-Hastings 
algorithm. 



50 



target pdf p{x) as n — )■ oo. The variates are generated sequentially in the 
manner of a Markov chain as follows: 



1. Given the value Xi = x, choose another value x' , called the proposal 
or move, according to some (fixed) conditional pdf q{x'\x), called the 
proposal pdf; 

2. Accept the move x' , i.e., set Xj+i = x' with probability min{l, a} where 

p{x') q(x\x') 



p{x) q(x'\x) ' 
3. If the move is not accepted, set Xj+i = x. 



(A.l) 



This algorithm is very practical for sampling pdfs having an exponential 
form, such as the Boltzmann-Gibbs distribution or the tilted pdf p^, because 
it requires only the computation of the acceptance ratio a for two values x 
and x' , and so does not require the knowledge of the normalization constant 
that usually enters in these pdfs. For the tilted pdf pk{uj), for example, 

Pk{u')q{u:\u:') ^ e"^^"(") p(^) 
pkiu:)q{LLi'\u) e"-''^"('^^p{u)q{u'\u)' 

The original Metropolis algorithm commonly used in statistical physics 
is obtained by choosing q{x'\x) to be symmetric, i.e., if q{x'\x) = q{x\x') for 
all X and x' , in which case a = p{x')/p{x). If q{x'\x) does not depend on x, the 
algorithm is referred to as the independent chain Metropolis-Hastings 
algorithm. 

For background information on Monte Carlo algorithms, applications, 
and technical issues such as mixing, correlation and convergence, see [31j 
and Chap. 13 of [2]. 
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Errata 



I list below some errors contained in the published version of these notes and 
in the versions (vl and v2) previously posted on the arxiv. 

The numbering of the equations in the published version is different from 
the version posted on the arxiv. The corrections listed here refer to the arxiv 
version. 

I thank Bernhard Altaner and Andreas Engel for reporting errors. I 
also thank Vivien Lecomte for a discussion that led me to spot the errors of 
Sections 4.4 and 4.5. 

Corrections to arxiv version 1 

• p. 15, in Eq. (2.35): The correct result is a^"-{l — a)^^~^^". [Noted by 
B. Altaner] 

• p. 16, in Ex. 2.7.8: "A(/c) has a continuous derivative" added. 

• p. 20, in Eq. (3.15): The limit T^O should be T ^ oo. [Noted by B. 

Altaner] 

• p. 25, in Ex. 3.6.2: Y G {-1, +1} with P{Y = 1) = P{Y = -1) = 1/2 

instead of Bernoulli. 

• p. 25, in Ex. 3.6.5: The RVs are now specified to be Gaussian RVs. 

Corrections to arxiv version 2 

• p. 1: In the abstract: missing 's' to 'introduce'. 

• p. 16, in Ex. 2.7.8: The conditions of X{k) stated in the exercise are 
again wrong: A(fc) must be differentiable and must not have affine 
parts to ensure that A'(A;) = s has a unique solution. 

• p. 20, in Eq. (3.17): The tilted generator is missing an identity matrix; 
it should read 

Gk = G + kx'S^^x'. 

[Noted by A. Engel] Part of Section 3.3 and Eq. (3.17) was rewritten 
to emphasize where this identity matrix term is coming from. 

• p. 27, in Eq. (3.40): The generator should have the off diagonal elements 
exchanged. Idem for the matrix shown in Eq. (3.45). [Noted by A. 
Engel] 
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These notes represent probability vectors as columns vectors. Therefore, 
the columns, not the rows, of stochastic matrices must sum to one, 
which means that the columns, not the rows, of stochastic generators 
must sum to zero. 

p. 27, in Ex. 3.6.13: It should be mentioned that f{xi,Xi-^i) is chosen 
as f{xi,Xi+i) = 1 - 6x„x,+i- [Noted by A. Engel] 

p. 36, in Section 4.4: Although the tilted pdf Pki^) of a Markov chain 
has a product structure reminiscent of a Markov chain, it cannot be 
described as a Markov chain, contrary to what is written. In particular, 
the matrix shown in Eq. (4.24) is not a proper stochastic matrix with 
columns summing to one. For that, we should define the matrix 

nk{x'\x) = W{k\x) = J e^-'7r{x'\x) dx' . 

However, the pdf of u) constructed from this transition matrix is 
different from the tilted joint pdf pfe(a;). Section 4.4 was rewritten to 
correct this error. 

This section is not entirely wrong: the Metropolis sampling can still 
be done at the level of pk by generating IID sequences uj^^\ as now 
described. 

p. 37, in Section 4.5: The error of Section 4.4 is repeated here at 
the level of generators: the Gk shown in Eq. (4.25) is not a proper 
stochastic generator with "columns" summing to zero. Section 4.5 was 
completely rewritten to correct this error. 

p. 42, Ex. 4.7.14: This exercise refers to the tilted joint pdf pk{uj); 
there is no tilted transition matrix here. The exercise was merged with 
the next one to correct this error. 
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